[OTEL] Add OpenTelemetry observability support#285
Open
royischoss wants to merge 35 commits intomlrun:developmentfrom
Open
[OTEL] Add OpenTelemetry observability support#285royischoss wants to merge 35 commits intomlrun:developmentfrom
royischoss wants to merge 35 commits intomlrun:developmentfrom
Conversation
…R_SPACE and MLRUN_MODEL_ENDPOINT_MONITORING__STORE_PREFIXES__MONITORING_APPLICATION plus removes MLRUN_MODEL_ENDPOINT_MONITORING__ENDPOINT_STORE_CONNECTION
# Conflicts: # charts/mlrun-ce/Chart.yaml # charts/mlrun-ce/README.md # charts/mlrun-ce/requirements.lock # charts/mlrun-ce/values.yaml # tests/kind-test.sh
royischoss
commented
Apr 9, 2026
…ion accordingly. add request and limit for crdReadinessJob and namespaceLabelJob
# Conflicts: # charts/mlrun-ce/Chart.yaml # charts/mlrun-ce/README.md # charts/mlrun-ce/requirements.lock
…, change naming for otel metrics using metadata.name fieldRef
…, change naming for otel metrics using metadata.name fieldRef
… empty templates, kubectl image - Move hardcoded OTel collector pipeline config into values.yaml under opentelemetry.collector.config — users can now override receivers, processors, exporters without forking the chart. Prometheus endpoint uses short DNS (prometheus-operated:9090) removing namespace interpolation from the helper. - Add opentelemetry.kubectlImage to values.yaml (defaults to bitnami/kubectl:latest) and reference it in both crd-readiness-job.yaml and namespace-label.yaml instead of hardcoded tag. - Fix namespace-label.yaml: replace indent with nindent for correct YAML formatting; change restartPolicy: Never to OnFailure so the job retries on transient failures. - Delete empty collector.yaml and instrumentation.yaml template files that generated no resources and were misleading. Move their documentation comment into crd-readiness-job.yaml where the actual CR creation happens. - Replace 50-line hardcoded collector manifest in _helpers.tpl with toYaml .Values.opentelemetry.collector.config | nindent 4.
royischoss
commented
Apr 20, 2026
…ilience, package.sh Issue 2 — Bounded retry in CR installer (High) The until kubectl apply loop in otel-cr-installer.yaml had no exit condition. On permanent failure (crashed operator, image pull error) it would spin silently until the 300s hook-timeout killed it with a cryptic deadline error. Added a max_retries=30 counter (30 × 5s = 150s max) with a clear exit 1 and operator log pointer on exhaustion. Applied to both the Collector and Instrumentation CR apply loops. Issue 3 — Skip rollout restart on upgrade (Medium) The otel-cr-installer hook runs on both post-install and post-upgrade. Previously it always restarted all mlrun.io/otel=true labeled pods, causing unnecessary Jupyter + Nuclio churn on every helm upgrade even when OTel was already running. Added an init container presence check: if pods already have the opentelemetry init container injected, the restart is skipped. Only restarts on fresh installs (where pods started before the webhook was ready). Issue 4 — Document PYTHONPATH workaround (Medium) The mlrun.api.extraEnvKeyValue.PYTHONPATH entry exists to prevent OTel's $(PYTHONPATH) expansion from resolving to an empty string. The ideal fix is adding instrumentation.opentelemetry.io/inject-python: "false" as a pod annotation to opt the MLRun API out of injection — but the upstream mlrun chart template hardcodes pod annotations with no podAnnotations values key. Improved the comment to document this constraint clearly so the next reader doesn't re-investigate. Issue 5 — Document Prometheus OTLP receiver (Medium) enableFeatures: [otlp-write-receiver] and --web.enable-otlp-receiver are always set on the Prometheus sub-chart regardless of opentelemetry.collector.enabled. There's no Helm-native way to conditionally configure sub-chart values. Added a comment explaining the always-on behavior and why the attack surface increase is negligible (OTLP endpoint shares the already-unauthenticated port 9090 with the Prometheus query API). Issue 6 — RBAC resources leak on uninstall (Medium) All 5 RBAC resources (ClusterRole + ClusterRoleBinding for otel-crd-reader, ServiceAccount + Role + RoleBinding for otel-cr-creator) were annotated as Helm hooks with before-hook-creation delete policy only. helm uninstall doesn't run hooks, so these cluster-scoped resources were never deleted — confirmed on the test cluster where old my-mlrun-otel-crd-reader from 11 days prior was still present. Removed all hook annotations; resources are now regular Helm-managed and deleted on uninstall. namespace-label.yaml moved from pre-install,pre-upgrade to post-install,post-upgrade (weight -10) so the regular RBAC resources exist before the labeling job runs. Issue 7 — package.sh version hardcoded + CRD slimming broken (Medium) Two bugs in tests/package.sh: 1. The OTel operator .tgz filename was hardcoded as opentelemetry-operator-0.78.1.tgz in two places — version bumps would silently skip the schema patch. Changed to read the version dynamically from requirements.yaml via Python. 2. The CRD slimming step replaced conf/crds/ templates with empty stubs (crds.create: false). This was broken when we switched to Option B (crds.create: true) — stubs rendered nothing, so CRDs were never created on a fresh packaged-chart install. Removed the slimming step entirely. The full CRD YAML (~1.6 MB) compresses to ~160 KB gzipped in the Helm release Secret, well within the 3 MB Kubernetes limit. Bash heredoc-in-until fix (discovered during cluster validation) The Instrumentation CR apply originally used until cat <<'EOF' | kubectl apply -f -; do — a heredoc inside an until condition. This is unreliable: the heredoc content is consumed on the first evaluation and subsequent retries pipe empty input to kubectl. Changed to write to /tmp/instrumentation-cr.yaml first (same pattern as the Collector CR), which is unconditionally safe across bash versions. Tests Added 7 new test cases to tests/helm-template-test.sh that catch all of the above regressions locally before cluster install: - RBAC: no helm.sh/hook annotations on any resource (catches hook leak) - RBAC: no before-hook-creation delete policy (same) - Namespace-label: uses post-install,post-upgrade not pre-install (catches SA missing at hook time) - CR installer: has max_retries, retries=0, and exit 1 (catches infinite loop) - CR installer: uses collector-cr.yaml and instrumentation-cr.yaml temp files (catches heredoc-in-until) - CR installer: has init container check and skip message (catches unnecessary upgrade restarts)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds OTel-based observability to MLRun CE with automatic Python instrumentation, deployment-mode metrics collection, and Prometheus integration.
https://iguazio.atlassian.net/browse/CEML-685
What's implemented
OTel operator sub-chart
opentelemetry-operatorv0.78.1 added as an optional dependencycrds.create: true— the sub-chart manages OTel CRDs directly; full CRD YAML (~1.6 MB) compresses to ~160 KB gzipped in the Helm release Secret, well within the 3 MB Kubernetes API limitopentelemetry.io/inject=enabled— avoids injecting into unrelated namespacestemplates/opentelemetry/
namespace-label.yaml — post-install,post-upgrade hook (weight -10) that labels and annotates the release namespace:
otel-cr-installer.yaml — post-install,post-upgrade hook (weight 10) that:
rbac.yaml — regular Helm-managed resources (no hook annotations): ServiceAccount, ClusterRole, ClusterRoleBinding, Role, RoleBinding for the installer job. Being regular resources means they are cleaned up on helm uninstall.
Metrics pipeline: push model (OTLP to Prometheus)
Instrumentation CR (Python auto-instrumentation)
Nuclio function pods
MLRun API — PYTHONPATH workaround
tests/package.sh
Admin / non-admin split
Generated with Claude Code